Correcting the optimally selected resampling-based error rate: A smooth analytical alternative to nested cross-validation
نویسندگان
چکیده
High-dimensional binary classification tasks, e.g. the classification of mi-croarray samples into normal and cancer tissues, usually involve a tuning parameter adjusting the complexity of the applied method to the examined data set. By reporting the performance of the best tuning parameter value only, over-optimistic prediction errors are published. The contribution of this paper is twofold. Firstly, we develop a new method for tuning bias correction which can be motivated by decision theoretic considerations. The method is based on the decomposition of the unconditional error rate involving the tuning procedure. Our corrected error estimator can be written as a weighted mean of the errors obtained using the different tuning parameter values. It can be interpreted as a smooth version of nested cross-validation (NCV) which is the standard approach for avoiding tuning bias. In contrast to NCV, the weighting scheme of our method guarantees intuitive bounds for the corrected error. Secondly, we suggest to use bias correction methods also to address the bias resulting from the optimal choice of the classification method among several competitors. This method selection bias is particularly relevant to prediction problems in high-dimensional data. In the absence of standards, it is common practice to try several methods successively , which can lead to an optimistic bias similar to the tuning bias. We demonstrate the performance of our method to address both types of bias based on microarray data sets and compare it to existing methods. This study confirms that our approach yields estimates competitive to NCV at a much lower computational price.
منابع مشابه
A general, prediction error-based criterion for selecting model complexity for high-dimensional survival models.
When fitting predictive survival models to high-dimensional data, an adequate criterion for selecting model complexity is needed to avoid overfitting. The complexity parameter is typically selected by the predictive partial log-likelihood (PLL) estimated via cross-validation. As an alternative criterion, we propose a relative version of the integrated prediction error curve (IPEC), which can be...
متن کاملFeature Selection Using Multi Objective Genetic Algorithm with Support Vector Machine
Different approaches have been proposed for feature selection to obtain suitable features subset among all features. These methods search feature space for feature subsets which satisfies some criteria or optimizes several objective functions. The objective functions are divided into two main groups: filter and wrapper methods. In filter methods, features subsets are selected due to some measu...
متن کاملCross-validation and Bootstrap for Optimizing the Number of Units of Competitive Associative Nets
Cross-validation and bootstrap, or resampling methods in general, are examined for applying them to optimizing the number of units of competitive associative nets called CAN2. There are a number of resampling methods available, but the performance depends on the neural network to be applied and functions to be learned. So, we apply several resampling methods to the CAN2 which has been shown eff...
متن کاملFast robust estimation of prediction error based on resampling
Robust estimators of the prediction error of a linear model are proposed. The estimators are based on the resampling techniques cross-validation and bootstrap. The robustness of the prediction error estimators is obtained by robustly estimating the regression parameters of the linear model and by trimming the largest prediction errors. To avoid the recalculation of timeconsuming robust regressi...
متن کاملA simulation study of cross-validation for selecting an optimal cutpoint in univariate survival analysis.
Continuous measurements are often dichotomized for classification of subjects. This paper evaluates two procedures for determining a best cutpoint for a continuous prognostic factor with right censored outcome data. One procedure selects the cutpoint that minimizes the significance level of a logrank test with comparison of the two groups defined by the cutpoint. This procedure adjusts the sign...
متن کامل